The dataset used for developed this task is available on Kaggle (“Telco
Customer Churn,” n.d.) and it contains information of
a Telco company from \(7043\) clients,
distributed across \(21\) different
variables including gender, MonthlyCharges,
and PhoneService, among others.
In the business sector, undertanding the number of customers gained and lost is crucial for maximazing profitability. Thus, analyzing if a client is thinking about changing its services into another company can help to evaluate and create new company’s strategies. This dataset is designed in order to predict customer churn, storagint the final decision of each client.
The information contained in the dataset can be group into different areas:
Personal information.
customerID; A unique identifier for the
client.
gender; The customer’s gender, Female or
Male.
SeniorCitizen; Binary variable indicating if the
customer is older than 65 (Yes) or not (No).
Partner; Binary variable which storage whether the
customer has a partner (Yes) or not (No).
Dependents; Binary variable that indicates if the
client lives with more people (Yes) or not (No).
Services subscribed.
tenure; The number of months the customer has been
with the company by the end of the third quarter (3Q).
PhoneService; Whether the customer subscribes to
home phone service (Yes or No).
MultipleLines; Whether the customer has more than
one telephone lines (Yes or No). This service is on ly available if the
user has active the PhoneService.
InternetService; Type of subscription to Internet
service (No, DSL, Fiber Optic, or Cable).
The following extra-services are only available for user who has
active the InternetService (the company does not charge any
additional fee).
OnlineSecurity; Whether the customer subscribes to
an online security service (Yes or No).
OnlineBackup; Whether the customer subscribes to an
online backup service (Yes or No).
DeviceProtection; Whether the customer subscribes to
a device protection plan (Yes or No).
TechSupport; Whether the customer subscribes to a
technical support plan from the company with reduced wait times (Yes or
No).
StreamingTV; Whether the customer use the Internet
service to stream television programming (Yes, No).
StreamingMovies; Whether the customer use the
Internet service to stream movies (Yes, No).
Account information
Contract; The type of contract that the customer
have signed (Month-to-Month, One Year, Two Year).
PaperlessBilling; Whether the customer has chosen
paperless billing (Yes or No).
PaymentMethod; Customer payment method (Bank
Withdrawal, Credit Card, or Mailed Check)
MonthlyCharges; Customer monthly charges for all
services from the company.
TotalCharges; Total expenses calculated to the end
of the 3Q (third quarter).
Target variable
Churn; Whether the customer left the company (Yes) this
quarter or stay with the company (No).The objective of this assignment is to predict the Churn
variable, which represents whether a client will remain with or leave
from the company’s services within the coming time period. This
prediction task is a binary classification problem where the goal is to
forecast one of the two possible outcomes: churn yes (leave the company)
or churn no (stay with the company services).
This valuable information can provide insights to the company, that allow it to form better strategies and enhance customer retention by optimizing service offering, as well as increase its profitability.
This section focuses on preparing the dataset for exploratory data analysis (EDA) and applying the computational models. This include loading the necessary material, handling missing values, encoding categorical variables, and splitting the dataset into training and testing subsets, among other tasks.
Load libraries. Several libraries are need for the project in order to work properly.
library(skimr) # Summary statistics
library(tidyverse) # Include ggplot2, dplyr, among others
library(mice) # Imputation missing data
library(VIM) # Imputation graph
library(caret) # Models
library(GGally) # Plots of the continuous variables
library(pander) # For making tables
library(gridExtra) # For grid.arrange
library(reshape2) # Data reshaping (melt)
library(corrplot) # Correlation matrix
library(patchwork)
library(knitr)
library(psych) # Compute the correlation matrix
library(polycor) # Compute the correlation heatmap
library(pROC) # ROC curve analysis
library(plotly) # Interactive plot
Load dataset. Import the dataset.
data <- read.csv("Telco-Customer-Churn.csv", header = TRUE, sep = ",")
First look at the dataset. Brief overview of the dataset for further preprocessing steps.
glimpse(data)
## Rows: 7,043
## Columns: 21
## $ customerID <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "7795-CFOCW…
## $ gender <chr> "Female", "Male", "Male", "Male", "Female", "Female",…
## $ SeniorCitizen <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Partner <chr> "Yes", "No", "No", "No", "No", "No", "No", "No", "Yes…
## $ Dependents <chr> "No", "No", "No", "No", "No", "No", "Yes", "No", "No"…
## $ tenure <int> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 58, 49, 2…
## $ PhoneService <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ MultipleLines <chr> "No phone service", "No", "No", "No phone service", "…
## $ InternetService <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "Fiber opt…
## $ OnlineSecurity <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", "Yes", "…
## $ OnlineBackup <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", "No", "N…
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", "No", "Y…
## $ TechSupport <chr> "No", "No", "No", "Yes", "No", "No", "No", "No", "Yes…
## $ StreamingTV <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "No", "Ye…
## $ StreamingMovies <chr> "No", "No", "No", "No", "No", "Yes", "No", "No", "Yes…
## $ Contract <chr> "Month-to-month", "One year", "Month-to-month", "One …
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes", "No", …
## $ PaymentMethod <chr> "Electronic check", "Mailed check", "Mailed check", "…
## $ MonthlyCharges <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89.10, 29.7…
## $ TotalCharges <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820.50, 1949…
## $ Churn <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", "No", "Y…
Some variables are removed due to its usefulness contribution on the
model for making predictions. For instance, customerID is
just the unique identifier for each customer. On the other hand,
PaperlessBilling parameter, nowadays does not have too much
sense. The main reason is that with the digital era we live in, most
customers choose for internet payments. Thus this columns are exclude
from the analsys.
# Remove unnecessary variables
data <- data %>% dplyr::select(-c(customerID, PaperlessBilling))
By analyzing the dataset, we can realized that certain variables are closely related, therefore some modification are apply.
Combine PhoneService and
MultipleLines. The first variable indicates if the
client has subscribed to home phone service or not. Only when the client
has hired this service, he can choose between having \(1\) phone line
(MultipleLines=No) or have more than \(1\) (MultipleLines=Yes). This
relationship lead us into three possible scenarios:
No phone service. Customers who have not
subscribed to phone service (PhoneService=No and
MultipleLines=“No phone service”).
Single phone line. Customers with phone service
and a single line (PhoneService=Yes and
MultipleLines=No).
Multiple phone lines. Customers with phone
service and multiple lines (PhoneService=Yes and
MultipleLines=Yes).
To simplify this, it is introduced a new variable
(PhoneServiceType) with three categories corresponding to
the above scenarios:
NoPhone: if there is no phone service.
PhoneSingle: if there is phone service with a single
line.
PhoneMultiple: if there is phone service with
multiple lines.
InternetService and its associate variables. A
similar situation happens with InternetService and its
related services (OnlineSecurity, StreamingTV,
among others). Clients without internet service cannot sign up to these
additional services, which are denoted as “Yes”, “No”, or “No internet
service” in the dataset. We can simplify this by considering “No
internet service” as equivalent to “No”, thereby converting these
variables into binary classes.
Note. For both situation, it has been computationally verified that there is consistency between the relationships of these variables. Thus avoiding the existence of possible nonsense situations.
## Unique values of each variable ----------------------------------------------
cat_vars <- sapply(data, is.character) # Select qualitative variables
lapply(data[, cat_vars], unique) # Print each unique values
## PhoneService and MultipleLines ----------------------------------------------
# Subset with clients where PhoneService is "No"
subset_no_Phone <- data[data$PhoneService == "No", ]
# Check MultipleLines is set to "No phone service"
if (all(subset_no_Phone$MultipleLines == "No phone service")) {
cat("In the subset where PhoneService is 'No',
MultipleLines is set to 'No phone service' only.\n")
} else {
cat("In the subset where PhoneService is 'No',
MultipleLines is not exclusively set to 'No phone service'.\n")
}
# Combine PhoneService and MultipleLines on a new variable (PhoneServiceType)
# Convert into a factor
data <- data %>%
mutate(PhoneServiceType = case_when(
MultipleLines == "No phone service" ~ "NoPhone",
MultipleLines == "No" ~ "PhoneSingle",
MultipleLines == "Yes" ~ "PhoneMultiple"
)) %>%
mutate(PhoneServiceType = factor(PhoneServiceType, levels = c("NoPhone", "PhoneSingle", "PhoneMultiple")))
# Remove original variables
data <- dplyr::select(data, -PhoneService, -MultipleLines)
## Consequences Internet service -----------------------------------------------
# Subset with clients where InternetService is "No"
subset_no_Internet <- data[data$InternetService == "No", ]
vars_to_check <- c("OnlineSecurity", "OnlineBackup", "DeviceProtection",
"TechSupport", "StreamingTV", "StreamingMovies")
# Check its associate variables are set to "No internet service"
for (variable in vars_to_check) {
unique_values <- unique(subset_no_Internet[[variable]])
cat("Unique values for", variable, ":", unique_values)
cat("\n")
}
# Reduce to a binary variable ("No internet service" -> "No")
for (variable in vars_to_check) {
data[[variable]] <- ifelse(data[[variable]] == "No internet service", "No", data[[variable]])
}
In the dataset, a small observation of the variable
TotalCharges contains missing values, only the \(0.156\)% of the customers. To handle these
missing values, we have several options:
Delete this rows; Due to the low presence of missing values, deleting these observation will not affect significantly the dataset.
Set NA’s to \(0\); Since these observations correspond to new clients we can assume it has not passed enough time to compute this value and stablishing to zero can be a good approach.
Imputation; We can impute the missing data using
techniques provided by the R library mice (van Buuren and
Groothuis-Oudshoorn 2011). By default, mice generates \(5\) imputations. However, for this project,
we will only consider the first imputation. We will use the ‘pmm’ method
(predictive mean matching), which imputes data based on the mean as a
reference. After generating the imputed data, we can complete the
missing values using the complete() function, which fills
in missing values with those from the first imputation.
To check there are no missing data anymore, we use a graph from the
library(VIM) (Kowarik and Templ 2016), which display
the percentage of NA’s.
Also, to assess whether the imputed data adequately reflects the
distributions of the original data, we can use the
densityplot() function. This function shows the marginal
distribution of the observed data in blue, and in red, it shows the
\(5\) densities for the predictor with
initially missing data.
## MICE Imputation -------------------------------------------------------------
meth <- rep("", ncol(data)) # Set all methods to "" (no imputation)
names(meth)[names(meth) == "TotalCharges"] <- "pmm" # Specify TotalCharges method
mice_model <- mice(data,
m = 5, maxit = 5, method = "pmm",
seed = 1234, print = FALSE
)
data <- complete(mice_model, 1) # Complete the data with the first imputation
As we can observed on the density plot, the distribution does not fit as properly as it should be. Hence, setting this observation to \(0\) or deleting them, might be a better option. However, due to the low presence of missing values on the dataset, we will continue the project with this imputation technique.
Once we have the dataset ready, the goal of this section is to get a dataset that our statistical models knows how to manage. This include, turn it into factor the categorical variables.
Binary variables. There are a total of \(11\) binary variables: gender,
SeniorCitizen, Partner,
Dependents, OnlineSecurity,
OnlineBackup, DeviceProtection,
TechSupport, StreamingTV,
StreamingMovies and Churn. All this variables
has been transformed into factor, with two possible outcomus “No” or
“Yes”.
Multi-sate variables. There are \(4\) variables of this kind of data:
InternetService, Contract,
PaymentMethod, and PhoneServiceType. As well
as before, I will create factor levels for each possible outcome. Notice
that this starategy can be follow due to the absence of any kind of
“natural order”.
## Binary encoding -------------------------------------------------------------
binary_variables <- c(
"Partner", "Dependents", "OnlineSecurity", "OnlineBackup",
"DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies", "Churn"
)
for (var in binary_variables) {
data[[var]] <- factor(data[[var]], levels = c("No", "Yes"))
}
data$gender <- factor(data$gender)
levels(data$gender) # Check the levels
unique(data$SeniorCitizen) # It is already transform in 1's and 0's
data$SeniorCitizen <- factor(data$SeniorCitizen,
levels = c(0, 1),
labels = c("No", "Yes"))
levels(data$SeniorCitizen)
## Multi-state variables -------------------------------------------------------
# Rename levels of all variables needed to avoid misunderstands later
# InternetService to a factor and rename levels
data$InternetService <- factor(data$InternetService,
levels = c("DSL", "Fiber optic", "No"),
labels = c("DSL", "FiberOptic", "No"))
levels(data$InternetService) # Check the levels
# Contract to a factor and rename levels
data$Contract <- factor(data$Contract,
levels = c("Month-to-month", "One year", "Two year"),
labels = c("MonthToMonth", "OneYear", "TwoYear"))
levels(data$Contract) # Check the levels
# PaymentMethod to a factor and rename levels
data$PaymentMethod <- factor(data$PaymentMethod,
levels = c("Electronic check",
"Mailed check",
"Bank transfer (automatic)",
"Credit card (automatic)"),
labels = c("ElectronicCheck",
"MailedCheck",
"BankTransfer",
"CreditCard")
)
levels(data$PaymentMethod) # Check the levels
Final step before applying the statistical tools consist on divide the dataset into two subsets. The training set, takes \(80\)% of the data and it is used to train the models. This subset allows the model to learn patterns and relationships within the data. On the other hand, testing set (\(20\)% of the data) is used by unseen data to evaluate the performance of the trained model.
set.seed(1234) # For reproducibility
# Split into training (80%) and testing (20%) set
index <- createDataPartition(data$Churn, p=0.8, list=FALSE)
training <- data[ index,]
testing <- data[-index,]
nrow(training) # Number of observation for training
nrow(testing) # Number of observation for testing
The aim is to uncover patterns, detect outliers, and test assumptions through visual and quantitative methods.
ggpair function from the GGally package, allow us to
create a scatter plot matrix for this three variables. This matrix
includes histograms for each variable along the diagonal, showcasing
their distribution. The lower triangle displays scatter plots to explore
the relationships between variable pairs, while the upper triangle shows
the correlation coefficients, providing a quantitative measure of their
relationships. This information allow us to identify the shape
distribution, tendencies, outliers, among other things, thanks to
different views in one grid.
# Select continuous variables (include target)
continuous_data <- training[, c("tenure", "MonthlyCharges", "TotalCharges", "Churn")]
# Create scatter plot matrix
ggpairs(continuous_data,
aes(fill="pink"),
lower = list(continuous = "points", combo = "box_no_facet"),
upper = list(continuous = "cor"),
diag = list(continuous = "barDiag"),
title = "Scatter plot matrix"
)
The histogram shows a bimodal distribution, with peaks at the extremes of the range. This implies that most of the customers are either new (low peak) or very loyal (high peak), with fewer clients in the mid-range.
Displays a significant peak at the begging of the distribution, suggesting a big amount of clients that pays around 25$ per month. On the other hand, the distribution becomes more uniform for charges between \(60\)$ and \(120\)$.
It follows a right skewness distribution, which means most of the people are subscribe to services that require to pay less amount of money. It is clear, that clients are willing to pay no more than \(1250 \$\). For trying to achieve a normal distribution I will apply a log transformation on this variable.
General insights.
This analysis reveals non-normal distribution for the numeric
variables due to the used of some statistical methods, the data will be
standardized. In addition, it is worth mentioning the high correlation
that exist between tenure and TotalCharges
(correlation coefficient of \(0.823\)).
Finally, it is also notably that this two parameters presents some
outliers.
Relation with the output.
If we keep doing a deeper exploration, customers with lower total
charges are more likely to leave (Churn=Yes). Furthermore,
the second insight is regarding the tenure. The density plot lead us
into the conclusion that clients tends to leave within the first year.
This could be because they find better alternatives or due to
dissatisfaction with the company.
Data transformation code.
# Numeric transformation
# TRAINING dataset -------------------------------------------------------------
training_transf <- training
# Log-transformation to solve TotalCharges right skewness
training_transf$TotalCharges <- log(training_transf$TotalCharges)
# Standardizing the variables
training_transf[c("tenure", "MonthlyCharges", "TotalCharges")] <- scale(training_transf[c("tenure", "MonthlyCharges", "TotalCharges")])
# TESTING dataset --------------------------------------------------------------
testing_transf <- testing
# Log-transformation to solve TotalCharges right skewness
testing_transf$TotalCharges <- log(testing_transf$TotalCharges)
# Standardizing the variables
testing_transf[c("tenure", "MonthlyCharges", "TotalCharges")] <- scale(testing_transf[c("tenure", "MonthlyCharges", "TotalCharges")])
Binary variables.
On the below table, we observe that gender as well as
Partner are data equally represented on the dataset, with
both categories close to a \(50\%\)
split. Meanwhile, only approximately \(16\)% of the observations belong to senior
citizens. This information aligns with the demographic expectation.
| Category | SeniorCitizen | Partner | Dependents | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies |
|---|---|---|---|---|---|---|---|---|---|
| No | 4716 | 2921 | 3927 | 4022 | 3693 | 3724 | 4016 | 3492 | 3442 |
| Yes | 920 | 2715 | 1709 | 1614 | 1943 | 1912 | 1620 | 2144 | 2194 |
| gender | n |
|---|---|
| Female | 2787 |
| Male | 2849 |
The following plots aim to show the churn distribution across each
binary variable. We realized that customer who leaves does not differ by
gender or whether they have partner or not. It looks like, people who
are subscribed to extra-services such as online security, online backup,
device protection, and tech support or has dependents, they tend to stay
at the company (Churn=No) most of the time.
Category variables.
Next step consist on explore the \(4\) multi-state variables:
InternetService, Contract,
PaymentMethod, and PhoneServiceType. Some
insights from this plots are that a majority of customers prefer having
internet as well as phone services, suggesting they are essential
services. However, there is no a favorite payment method, more less
equally distributed. Lastly, it looks like clients most popular choice
is to revenue the contract month by month.
Finally, if we study the relationship of these variables with the target we realized people with two year contracts tends to be loyaler clients (which are less likely to change to another company). While customer paying by electronic check have a higher probability to abandon the company.
The heatmap represents the strength and direction of the relationships between the variables in the dataset, where the color intensity indicates the magnitude of the correlation. Dark colors, that can be either blue (positive) or red (negative), suggest a stronger relationship between variables. A positive coefficient means that an increase (or decrease) in one variable is associated with an increase (or decrease) in the other. Conversely, a negative correlation indicated that as one variables increase, the other decrease. And, color in the middle are interpreted as weak correlations.
For dealing with mixed data type, the hetcorfunction
from the psych (William Revelle 2023) package in R is
used. This enable the computation of correlation across numeric, binary
and multi-state variables as long as the qualitative variable are
represented in factors.
# Works for mix variables: binay, multi-state and numeric
# Qualitative variables have to be in factor
cor_matrix <- hetcor(training_transf)
cor_values <- cor_matrix$correlations
corrplot(cor_values, method="color", type="upper", order="hclust",
tl.cex=0.7, tl.col="black", tl.srt=45)
Analytical exploration
Coefficients higher than \(0.7\) or lower than \(-0.7\) are considered strong. In this dataset we encounter \(5\) strong relationship, that can be shown on the below table.
Most of this relationships are kind of obvious. For instance, if the
client spend more time having the company services, then it accumulated
charges will be higher too. However, we can get an interesting insight:
customers who subscribe to one streaming service are likely to subscribe
to the other (StreamingMovies-StreamingTV =
\(0.7526\)).
cor_flat <- as.data.frame(as.table(cor_values))
cor_flat <- cor_flat[order(-abs(cor_flat$Freq)),]
# Filter correlations: >0.7 or <-0.7
strong_correlations <- subset(cor_flat, abs(Freq) > 0.7 & Var1 != Var2)
# Remove duplicates rows
strong_correlations$pairID <- apply(strong_correlations[, c("Var1", "Var2")],
1,
function(x) paste(sort(x), collapse = "-"))
strong_correlations <- strong_correlations[!duplicated(strong_correlations$pairID),]
strong_correlations$pairID <- NULL # Remove ID
# Add sign
strong_correlations$Sign <- ifelse(strong_correlations$Freq > 0, "+", "-")
pander(strong_correlations)
| Var1 | Var2 | Freq | Sign | |
|---|---|---|---|---|
| 88 | TotalCharges | tenure | 0.8338 | + |
| 195 | MonthlyCharges | StreamingTV | 0.7629 | + |
| 213 | MonthlyCharges | StreamingMovies | 0.7549 | + |
| 192 | StreamingMovies | StreamingTV | 0.7526 | + |
| 85 | Contract | tenure | 0.7143 | + |
To see the distribution of our target variable (Churn),
we first look at the raw counts and then at the percentages to
understand the proportion of customers who leave versus those who stay
with the company’s services.
# Check counts of Churn
pander(table(data$Churn))
| No | Yes |
|---|---|
| 5174 | 1869 |
# Results in percentage
pander(prop.table(table(data$Churn)) * 100)
| No | Yes |
|---|---|
| 73.46 | 26.54 |
The dataset is slightly imbalance in the distribution of the
Churn variable. There are more observations from clients
who decide to stay with the company services (Churn=No)
than who decide to leave. Although the existence of imbalance on the
dataset, it is relatively minor.
After applying a sampling technique in order to solve this issue, no improvement was observed in the model’s performance. In fact, the results were worse compared to using the original data proportions. This results suggest that the imbalance may not impact the model’s accuracy. Therefore, I have decided not to apply any methods for balancing the dataset when computing the statistical tools. However, code for performing upsampling is provided in the Annex section.
The objective of this section is to evaluate the performance of some classification models for the binary classification task. The goal is to predict whether the customer intends to stop using the company’s services or not. For each of these techniques, we will compute the confusion matrix, a tool which will indicate us how many instances were correctly classified and how many were misclassified.
First, the dataset is divided into predictors and the target variable.
# Divide into predictors and target both subsets
X_train <- training_transf %>% dplyr::select(-Churn)
y_train <- training_transf$Churn
X_test <- testing_transf %>% dplyr::select(-Churn)
y_test <- testing_transf$Churn
n_test <- dim(testing_transf)[1]
To ensure a robust evaluation, a common training control parameter is configure: \(5\) repeats of \(10\)-fold cross-validation.
# Set up train control -> 5 repeats of 10-fold cross validation
train_ctrl <- trainControl(method = "repeatedcv",
repeats = 5,
number = 10)
Finally, it is worth mentioning that we generated a plot for all
models with the posterior probabilities of churn for each observation in
the test set. The point represent each predictions with the model we are
evaluating, where blue points are the customers that decide to stay
(Churn=No) and red points the ones that probably will
abandond the company (Churn=Yes). The dashed horizontal
line at \(0.5\) serves as a decision
threshold, indicating the probability level at which the model switches
from predicting a customer will not churn to predicting they will churn.
This allow us to give a visually representation of how the model is
predicting.
QDA refers to Quadratic Discriminant Analysis, this model assumes a Gaussian distributions of the predictors for each class but its covariance matrix can be different.
# Train the QDA model using caret
model_QDA <- train(Churn ~ .,
data = training_transf,
method = "qda",
trControl = train_ctrl) # Use the defined train control
# Predictions
predictions_QDA <- predict(model_QDA, X_test)
# head(predictions_QDA)
# Posterior probabilities
post_prob_QDA <- predict(model_QDA, X_test, type = "prob")
# head(post_prob_QDA)
# Performance measures
confMat_QDA <- table(predictions_QDA, testing_transf$Churn)
addmargins(confMat_QDA)
##
## predictions_QDA No Yes Sum
## No 797 86 883
## Yes 237 287 524
## Sum 1034 373 1407
## [1] "------ Testing measures ------"
## [1] "Error for QDA: 0.229566"
## [1] "Accuracy for QDA: 0.770434"
# Plot
ggplot(data = testing_transf, aes(x = seq_along(testing_transf$Churn))) +
geom_point(aes(y = post_prob_QDA[,"Yes"], color = Churn)) +
theme_light(base_size = 14) +
scale_color_manual(values = c("No" = "deepskyblue2", "Yes" = "firebrick2")) +
xlab("Individual") +
ylab("Probability of Churn with QDA") +
geom_hline(yintercept = 0.5, linetype = "dashed")
The confusion matrix and the plot represent the same idea in different manner. From the plot, we observe that the red points below and blue points above the threshold line are point that has been misclassifications. Meanwhile in the confusion matrix this clients can be found over the no-diagonal values.
We can conclude that the model:
It is good at identifying the customers who will stay and actually stay at the company (\(90.26\)% right).
The worst mistake is when the model predict the client will stay and he actually left, and with QDA happens \(86\) times.
The other possible error is when it predict the client will leave, but its real choice is that he will continue with the company (\(237\) wrong identifications).
Finally, the model just gets right un \(54.77\)% when predicting the client will leave and actually leave.
Furthermore, the kappa value is not too high.
Linear Discriminant Analysis (LDA) is a simplified version of QDA model which introduce some bias to reduce the variance.
# Train the LDA model using caret
model_LDA <- train(Churn ~ .,
data = training_transf,
method = "lda",
trControl = train_ctrl) # Use the defined train control
# Predictions
predictions_LDA <- predict(model_LDA, X_test)
# head(predictions_LDA)
# Posterior probabilities
post_prob_LDA <- predict(model_LDA, X_test, type = "prob")
# head(post_prob_LDA)
# Performance measures
confMat_LDA <- table(predictions_LDA, testing_transf$Churn)
addmargins(confMat_LDA)
##
## predictions_LDA No Yes Sum
## No 934 163 1097
## Yes 100 210 310
## Sum 1034 373 1407
## [1] "------ Testing measures ------"
## [1] "Error for LDA: 0.186923"
## [1] "Accuracy for LDA: 0.813077"
# Plot
ggplot(data = testing_transf, aes(x = seq_along(testing_transf$Churn))) +
geom_point(aes(y = post_prob_LDA[,"Yes"], color = Churn)) +
theme_light(base_size = 14) +
scale_color_manual(values = c("No" = "deepskyblue2", "Yes" = "firebrick2")) +
xlab("Individual") +
ylab("Probability of Churn with LDA") +
geom_hline(yintercept = 0.5, linetype = "dashed")
This model achieved an accuracy of \(81.3077\)% improving the QDA performance,
thus we can conclude it is able to make good predictions about if a
client will churn or not.
However compared to QDA, the errors that suppose worse for the company have increase from \(86\) to \(163\). However, the mistakes from a client leaving the company but actually stay they really decrease. As before it accurate will identify people who will stay with the company.
Model useful when the dataset is very large and on multi-class predictors. For this dataset might not work as well as QDA or LDA. On the contrary, it usually works good for predictor that are independence one from each other.
# Train Naive Bayes model
model_NB <- train(x = X_train,
y = y_train,
method = "naive_bayes",
trControl = train_ctrl,
tuneLength = data.frame(laplace = 0.5,
usekernel = FALSE,
adjust = FALSE))
# Predict on test data
predictions_NB <- predict(model_NB, newdata = X_test)
# head(predictions_NB)
# Posterior probabilities
post_prob_NB <- predict(model_NB, X_test, type = "prob")
# head(post_prob_NB)
# Evaluate the model using confusion matrix
confusionMatrix(predictions_NB, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 846 108
## Yes 188 265
##
## Accuracy : 0.7896
## 95% CI : (0.7674, 0.8107)
## No Information Rate : 0.7349
## P-Value [Acc > NIR] : 1.127e-06
##
## Kappa : 0.4947
##
## Mcnemar's Test P-Value : 4.395e-06
##
## Sensitivity : 0.8182
## Specificity : 0.7105
## Pos Pred Value : 0.8868
## Neg Pred Value : 0.5850
## Prevalence : 0.7349
## Detection Rate : 0.6013
## Detection Prevalence : 0.6780
## Balanced Accuracy : 0.7643
##
## 'Positive' Class : No
##
# Plot
ggplot(data = testing_transf, aes(x = seq_along(testing_transf$Churn))) +
geom_point(aes(y = post_prob_NB[,"Yes"], color = Churn)) +
theme_light(base_size = 14) +
scale_color_manual(values = c("No" = "deepskyblue2", "Yes" = "firebrick2")) +
xlab("Individual") +
ylab("Probability of Churn with Naïve Bayes") +
geom_hline(yintercept = 0.5, linetype = "dashed")
In summary, the model achieved an accuracy of \(78.96\)%, which, while is not the highest accuracy among the models tested, is still pretty good. It demonstrates a string ability to correctly identify the client that will remain with the company (Positive predictive Value). However, regarding to the wrong predictions, it can be notice that it makes \(108\) huge mistakes (which is in the average of the previous models).
Logistic regression is a statistical method for predicting binary problems which allow also use categorical predictors. It looks like a model that can fit pretty well with my classification task. However, it must be notice that it is sensitive to outliers on the data.
# Train Logistic Regression classifier
model_LR <- train(Churn ~.,
data = training_transf,
method = "glm",
family = "binomial", # For binary logistic regression
trControl = train_ctrl)
# Predict on test data
predictions_LR <- predict(model_LR, X_test)
# head(predictions_LR)
# Posterior probabilities
post_prob_LR <- predict(model_LR, X_test, type = "prob")
# head(post_prob_LR)
# Evaluate the model using confusion matrix
confusionMatrix(predictions_LR, y_test)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 944 168
## Yes 90 205
##
## Accuracy : 0.8166
## 95% CI : (0.7954, 0.8365)
## No Information Rate : 0.7349
## P-Value [Acc > NIR] : 3.431e-13
##
## Kappa : 0.4957
##
## Mcnemar's Test P-Value : 1.636e-06
##
## Sensitivity : 0.9130
## Specificity : 0.5496
## Pos Pred Value : 0.8489
## Neg Pred Value : 0.6949
## Prevalence : 0.7349
## Detection Rate : 0.6709
## Detection Prevalence : 0.7903
## Balanced Accuracy : 0.7313
##
## 'Positive' Class : No
##
# Plot
ggplot(data = testing_transf, aes(x = seq_along(testing_transf$Churn))) +
geom_point(aes(y = post_prob_LR[,"Yes"], color = Churn)) +
theme_light(base_size = 14) +
scale_color_manual(values = c("No" = "deepskyblue2", "Yes" = "firebrick2")) +
xlab("Individual") +
ylab("Probability of Churn with Logistic regression") +
geom_hline(yintercept = 0.5, linetype = "dashed")
To conclude, the model’s accuracy is \(81.6631\)% so it can correctly predict both retainment and churn. The sensitivity is high, showing case the model is very good at identifying customers who will stay with the company services. However, this model tend to predict wrongly those clients which will leave (specificity). Finally, we can observed that the most important errors achieved to \(168\) wrong points classification.
For incorporating the economic impact that can be cause by the remain or left of a customer on a Telco company, I will compute the cost for the best accuracy model: Logistic regression classifier. Notice it presents the minor average errors and highest accuracy (see the chart below).
Recall (Churn); Yes = the customer left
the company, and No = the customer remained with the company.
When a model makes wrong prediction, we must take into account that not all errors carry the same financial consequences. For instance, if the model predicts that a customer will stay (when they actually leave) this can turn into a loss of the expected revenue from that customer. On the other hand, mistakenly predicting a client will leave (and they actually stay) does not have a huge influence on the company, but it might lead into a loss of an opportunity.
Thus, it is important to understand correctly the confusion matrix for the chosen model (logistic regression classifier). Worst error in the below table is the one committed \(168\) times. Then, the goal is to decreasing this error by changing the probability threshold. The optimal threshold can be selected using some specific economic effects.
##
## predictions_LR No Yes
## No 944 168
## Yes 90 205
Economic assumption used.
If the company predicts a customer is going to stay and it actually stay, the company gains a 12% profit at the end of the quarter.
If the company predicts a customer is going to stay but he leaves, the loss is 100%.
If the company predict a customer is going to leave but he actually stay, the profit is 10%.
If the company predict a customer is going to leave and he actually leave, then the profit is 3%.
Using the specified profits and losses for each prediction outcome, we compute the average profit per customer based on the model’s predictions.
# Confusion matrix values
TN = 944
FP = 168
FN = 90
TP = 205
# profit/loss in %
profit_TN = 0.20
loss_FP = -1.0
profit_FN = -0.01
profit_TP = 0.03
# Compute profit per client
profit_per_client = (TN * profit_TN + FP * loss_FP + FN * profit_FN + TP * profit_TP) / (TN + FP + FN + TP)
sprintf("Profit per customer: %f", profit_per_client)
## [1] "Profit per customer: 0.018515"
The goal is to improve this economic output by reducing the most costly mistakes (costumers that are predicted to stay but actually change into another company).
profit_unit <- c(0.20, -0.01, -1.0, 0.03)
## Selecting the optimal threshold
profit_i = matrix(NA, nrow = 50, ncol = 10)
# 100 replicates for training/testing sets for each of the 10 values of threshold
p0 <- 0.73
p1 <- 1-p0
j <- 0
# This is for doing hyper-parameter
for (threshold in seq(0.05,0.5,0.05)){
j <- j + 1
cat(j)
for(i in 1:50){
# partition data intro training (40%) and testing sets (60%)
d <- createDataPartition(training_transf$Churn, p = 0.4, list = FALSE)
# select training sample
train <- training_transf[d,]
test <- training_transf[-d,]
# Fit logistic regression model
full_model <- glm(Churn ~ ., data = training_transf, family = "binomial")
# Obtain predicted probabilities on test set
probabilities <- predict(full_model, test, type = "response")
# Convert probabilities to predicted classes based on threshold
Churn_pred <- ifelse(probabilities > threshold, "Yes", "No")
# Create confusion matrix
CM <- table(Churn_pred, test$Churn)
# Calculate profit per applicant
profit.applicant <- sum(profit_unit*CM)/sum(CM)
profit_i[i,j] <- profit.applicant
}
}
## 12345678910
boxplot(profit_i, main = "Hyper-parameter selection",
ylab = "unit profit",
xlab = "threshold", names = seq(0.05,0.5,0.05),col="royalblue2")
The width of the boxes on the boxplot represent the variability of the unit profit across the different tries of the hyper-parameter tuning. A smaller width means a more reliable model’s performance. Therefore, in this situation, for the Telco company, the optimal threshold is \(0.05\).
Finally, once we have computed the optimal threshold we can use it in order to train our model and make predictions over the test set. This will turn into a worst accuracy model but our economic profit will increased.
## Final prediction for testing set using the optimal hyper-parameter:
best_threshold <- 0.05
# Fit the logistic regression model using the training data
final_model <- train(Churn ~ .,
data = training_transf,
method = "glm",
family = "binomial",
trControl = train_ctrl)
# Predict probabilities on the test set
probabilities_post <- predict(model_LR, newdata = testing_transf, type = "prob")
# Make final predictions using the best threshold
final_predictions <- ifelse(probabilities_post[, "Yes"] > best_threshold,
"Yes", "No")
# Confusion matrix
confusionMatrix(factor(final_predictions), testing_transf$Churn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 376 6
## Yes 658 367
##
## Accuracy : 0.5281
## 95% CI : (0.5016, 0.5544)
## No Information Rate : 0.7349
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.223
##
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.3636
## Specificity : 0.9839
## Pos Pred Value : 0.9843
## Neg Pred Value : 0.3580
## Prevalence : 0.7349
## Detection Rate : 0.2672
## Detection Prevalence : 0.2715
## Balanced Accuracy : 0.6738
##
## 'Positive' Class : No
##
CM <- confusionMatrix(factor(final_predictions), testing_transf$Churn)$table
profit.applicant <- sum(profit_unit*CM)/sum(CM)
sprintf("Profit per customer (with optimal threshold): %f", profit.applicant)
## [1] "Profit per customer (with optimal threshold): 0.052331"
As our initial though the profit of the company increase but we create a worse model regarding to the accuracy. However, due to the importance of obtaining benefits, this model even though it makes more mistakes it is much beneficial for the business. We have been able to decrease to \(6\) the worst errors.
This project is focus on analyzing customer churn for a telecommunications company, the dataset contains \(21\) variables such as gender, total charges, and subscription services, among others. The first step is based on cleaning and manipulated the data, which included deleting variables due to its low influence and some transformations in order to approach a normal distribution.
The statistical models assessed included Quadratic Discriminant Analysis (QDA), Linear Discriminant Analysis (LDA), Naïve Bayes (NB), and Logistic Regression (LR). Among all models, the logistic regression classifier has demonstrated to be the most accurate techinque, reaching an accuracy of \(81.66\)%. This good performance can be explained thanks to the model’s capacity to handle binary outcomes and its robustness to small sample sizes, making it a robust tool for this dataset.
Last part of the project include the economic impact of our model based on the prediction errors, demonstrating that not all errors hold the same consequences for the company point of view. For instance, the “worst” mistake is predicting the client will stay having the company services but they actually cancel them. By incorporating the economic impact into the model evaluation, it became clear that a slightly less accurate model could yield higher profits by minimizing those “worst” and costly mistakes.
In conclusion, the project demonstrate the importance of selecting the right model not only based on the statistical accuracy also by understanding the economic impact. The right decision will lead us into a model less accurate which maximize our profit. And it can be done easly by adjusting the threshold for predicting churn to find the balance of these two situation.
############ Summary ACCURACY ############
## QDA -----------------------------------------------
# Accuracy: 0.770434 and Worst error: 86
## LDA -----------------------------------------------
# Accuracy: 0.813077 and Worst error: 163
## NB ------------------------------------------------
# Accuracy: 0.7896 and Worst error: 108
## LR ------------------------------------------------
# Accuracy: 0.8166 and Worst error: 168
## Incorporing economic impact -----------------------
# Accuracy: 0.5281 and Worst error: 6
This section contains the code related for balancing the dataset. The most promising method is downsampling. This approach avoids the need to “make up” values by reducing the majority class.
For all this models, we configure a common train control parameter.
# Solve the unbalanced of my dataset
# Set up train control -> 5 repeats of 10-fold cross validation
# Resolve class imbalances -> sampling = "down"
train_ctrl_balanced <- trainControl(method = "repeatedcv",
repeats = 5,
number = 10,
sampling = "down")
# Train the QDA model using caret
model_QDA_b <- train(Churn ~ .,
data = training_transf,
method = "qda",
trControl = train_ctrl_balanced) # Use the defined train control
print(model_QDA_b)
# Predictions
predictions_QDA_b <- predict(model_QDA_b, X_test)
head(predictions_QDA_b)
# Posterior probabilities
post_prob_QDA_b <- predict(model_QDA_b, X_test, type = "prob")
head(post_prob_QDA_b)
# Performance measures
confMat_QDA_b <- table(predictions_QDA_b, testing_transf$Churn)
addmargins(confMat_QDA_b)
# Error of the model
error_QDA_b <- (n_test - sum(diag(confMat_QDA_b))) / n_test
error_QDA_b
# Ensure the correct train control is used for the balanced LDA model
model_LDA_b <- train(Churn ~ .,
data = training_transf,
method = "lda",
trControl = train_ctrl_balanced) # Correctly use the balanced train control
print(model_LDA_b)
# Predictions for the balanced LDA model
predictions_LDA_b <- predict(model_LDA_b, X_test)
head(predictions_LDA_b)
# Posterior probabilities for the balanced LDA model
post_prob_LDA_b <- predict(model_LDA_b, X_test, type = "prob")
head(post_prob_LDA_b)
# Performance measures for the balanced LDA model
confMat_LDA_b <- table(predictions_LDA_b, testing_transf$Churn)
addmargins(confMat_LDA_b)
# Calculate the error rate for the balanced LDA model
error_LDA_b <- (n_test - sum(diag(confMat_LDA_b))) / n_test
error_LDA_b
# Train Naive Bayes model
model_NB_b <- train(x = X_train,
y = y_train,
method = "naive_bayes",
trControl = train_ctrl_balanced,
tuneLength = data.frame(laplace = 0.5,
usekernel = FALSE,
adjust = FALSE))
# Predict on test data
predictions_NB_b <- predict(model_NB_b, newdata = X_test)
head(predictions_NB_b)
# Posterior probabilities
post_prob_NB_b <- predict(model_NB_b, X_test, type = "prob")
head(post_prob_NB_b)
# Evaluate the model using confusion matrix
confusionMatrix(predictions_NB_b, y_test)
# Performance measures -> withou using caret
confMat_NB_b <- table(predictions_NB_b, testing_transf$Churn)
addmargins(confMat_NB_b)
# Error of the model
error_NB_b <- (n_test - sum(diag(confMat_NB_b))) / n_test
error_NB_b
model_LR_b <- train(Churn ~.,
data = training_transf,
method = "glm",
family = "binomial", # For binary logistic regression
trControl = train_ctrl_balanced)
# Print the model summary
print(model_LR_b)
# Predict on test data
predictions_LR_b <- predict(model_LR_b, X_test)
head(predictions_LR_b)
# Posterior probabilities
post_prob_LR_b <- predict(model_LR_b, X_test, type = "prob")
head(post_prob_LR_b)
# Evaluate the model using confusion matrix
confusionMatrix(predictions_LR_b, y_test)
# Performance measures -> without using caret
confMat_LR_b <- table(predictions_LR_b, testing_transf$Churn)
addmargins(confMat_LR_b)
# Error of the model
error_LR_b <- (n_test - sum(diag(confMat_LR_b))) / n_test
error_LR_b
Comparative balanced models
############ Summary ACCURACY ############
## QDA -----------------------------------------------
# Accuracy: 0.7349524
## LDA -----------------------------------------------
# Accuracy: 0.7404545
## NB ------------------------------------------------
# Accuracy: 0.7896
## LR ------------------------------------------------
# Accuracy: 0.7455974